Snowflake is a SaaS-based Data Warehouse platform built over AWS,Microsoft Azure, and Google Cloud infrastructure. One of the features behind this software’s popularity with businesses worldwide is its scalability, making it cost-effective. The architecture involves virtual compute instances and efficient storage buckets that run solely on the cloud.It serves as a centralized platform for data management, data lakes, data engineering, data applications development, data science, and secure sharing and consumption of real-time and shared data. Providing a centralized system to consolidate all data, Snowflake revolutionized the data warehousing industry. With Snowflake, you can simplify data warehouse management without sacrificing features.
Explain Snowflak Architecture
The Snowflake architecture is a hybrid of shared-disk (A common disk or storage device is shared by all computing nodes) and shared-nothing (Each computing node has a private memory and storage space) database architecture in order to combine the best of both. Snowflake utilizes a central data repository for persistent data, which is available to all compute nodes similar to a shared-disk architecture. But, equally, as with shared-nothing architectures, Snowflake uses massively parallel computing (MPP) clusters for query processing, in which each node stores part of the whole data set locally.
All data in Snowflake tables is automatically divided into micro-partitions, which are contiguous units of storage. Each micro-partition contains between 50 MB and 500 MB of uncompressed data (note that the actual size in Snowflake is smaller because data is always stored compressed).Micro-partitioning is automatically performed on all Snowflake tables. Tables are transparently partitioned using the ordering of the data as it is inserted/loaded.
The Snowflake architecture is divided into three key layers as shown below:
- Database Storage Layer: Once data has been loaded into Snowflake, this layer reorganizes that data into a specific format like columnar, compressed, and optimized format. The optimized data is stored in cloud storage. Snowflake manages all aspects of how this data is stored — the organization, file size, structure, compression, metadata, statistics, and other aspects of data storage are handled by Snowflake. The data objects stored by Snowflake are not directly visible nor accessible by customers; they are only accessible through SQL query operations run using Snowflake.
- Query Processing Layer: In the processing layer, queries are executed using virtual warehouses. Virtual warehouses are independent MPP (Massively Parallel Processing) compute clusters comprised of multiple compute nodes that Snowflake allocates from cloud providers. Due to the fact that virtual warehouses do not share their compute resources with each other, their performance is independent of each other.
- Cloud Services Layer: It provides services to administer and manage a Snowflake data cloud, such as access control, authentication, metadata management, infrastructure management, query parsing, optimization, and many more.
What do you mean by virtual warehouse?
In Snowflake, a virtual warehouse, often known as a "warehouse," is a collection of computational resources. A virtual warehouse provides the resources required for the users like CPU, memory, and temporary storage to perform multiple Snowflake operations:
Execute the SQL SELECT statements that necessitate the use of computing resources (e.g. retrieving rows from tables and views).
DML operations include:
Updating table rows (DELETE , INSERT , UPDATE).
Data Loading into tables (COPY INTO <table>).
Data unloading from tables (COPY INTO <location>).
Tuning
- Data Loading -: Avoid Scanning External Files,Snowflake supports querying data in both ways: managed and external. Data kept on external storage will have slower performance as compared to data managed by Snowflake internally. So, if we can move data to Snowflake we should consider following supported ways by Snowflake for this.
- Snowpipe – Enables loading data from files as soon as they’re available in a stage. This means you can load data from files in micro-batches, making it available to users within minutes, rather than manually executing COPY statements on a schedule to load larger batches. So, Snowpipe helps in quickly creating Managed Table, and not to be confused with External Table.
As data is loaded by date, it tends to be naturally clustered, with all data for the same day falling into the same micro-partition. However, if the following SQL is executed, Snowflake will attempt to keep all sale dates in the same micro-partition. When needed, a background task will automatically re-cluster the data, and this compute processing will be charged as a separate item.
- The query results from the view don’t change often. This almost always means that the underlying/base table for the view doesn’t change often, or at least that the subset of base table rows used in the materialized view don’t change often.
- The results of the view are used often (typically significantly more often than the query results change).
- The query consumes a lot of resources. Typically, this means that the query consumes a lot of processing time or credits, but it could also mean that the query consumes a lot of storage space for intermediate results.
- The results are not used often (relative to the rate at which the results change).
- The query is not resource intensive so it is not costly to re-run it.
- Restore the data-associated objects that may have lost unintentionally.
- For examining the data utilization and changes done to the data in a specific time period.
- Duplicating and backing up the data from the essential points in history.
No comments:
Post a Comment